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Abstract — Boosting is of great interest recently in the machine 
learning community because of the impressive performance for 
classification and regression problems. The success of boosting 
algorithms may be interpreted in terms of the margin theory 
(!}. Recently, it has been shown that generalization error of 
classifiers can be obtained by explicitly taking the margin 
distribution of the training data into account. Most of the current 
boosting algorithms in practice usually optimize a convex loss 
function and do not make use of the margin distribution. In 
this work we design a new boosting algorithm, termed margin- 
distribution boosting (MDBoost), which directly maximizes the 
average margin and minimizes the margin variance at the same 
time. This way the margin distribution is optimized. A totally- 
corrective optimization algorithm based on column generation 
is proposed to implement MDBoost. Experiments on various 
datasets show that MDBoost outperforms AdaBoost and LPBoost 
in most cases. 

Index Terms — boosting, AdaBoost, margin distribution, col- 
umn generation. 

I. Introduction 

Boosting offers a method for improving existing classifi- 
cation algorithms. Given a training dataset, boosting builds 
a strong classifier using only a weak learning algorithm [1], 
||2]. Typically, a weak (or base) classifier generated by the 
weak learning algorithm has a misclassification error that is 
slightly better than random guess. A strong classifier has a 
much better test error. In this sense, boosting algorithms can 
boost the weak learning algorithm to obtain a much stronger 
classifier. Boosting was originally proposed as an ensemble 
learning method, which depends on majority voting of multiple 
individual classifiers. Later, Breiman |3| and Friedman et al. 
Q observed that many boosting algorithms can be viewed 
as gradient descent optimization in functional space. Mason 
et al. |5J developed AnyBoost for boosting arbitrary loss 
functions with a similar idea. Despite the large success in 
practice of these boosting algorithms, there are still open 
questions about why and how boosting works. Inspired by 
the large-margin theory in kernel methods, Schapire et al. |T| 
presented a margin-based bound for AdaBoost, which tries to 
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interpret AdaBoost's success with the margin theory. Although 
the margin theory provides a qualitative explanation of the 
effectiveness of boosting, the bounds are quantitatively weak. 
A recent work |6| has proffered new tighter margin bounds, 
which may be useful for quantitative predictions. Arc-Gv f3\, a 
variant of the AdaBoost algorithm, was designed by Breiman 
to empirically test AdaBoost's convergence properties. It is 
very similar to AdaBoost (only different in calculating the 
coefficient associated with each weak classifier) such that 
it increases margins even more aggressively than AdaBoost. 
Breiman's experiments on Arc-Gv show contrary results to the 
margin theory: Arc-Gv always has a minimum margin that is 
provably larger than AdaBoost but Arc-Gv performs worse in 
terms of test error f3^. Grove and Schuurmans f7^ observed the 
same phenomenon. In the literature, much work has focused on 
maximizing the minimum margin ||8[-pO). Recently, Reyzin 
and Schapire 1 11 ] re-ran Breiman's experiments by controlling 
weak classifiers' complexity. They found that a better margin 
distribution is more important than the minimum margin. It 
is of importance to have a large minimum margin, but not 
at the expense of other factors. They thus conjectured that 
maximizing the average margin rather than the minimum 
margin may lead to improved boosting algorithms. We try to 
verify this conjecture in this work. 

Recently, Garg and Roth |[T2| introduced margin distribu- 
tion based complexity measure for learning classifiers and 
developed margin distribution based generalization bounds. 
Competitive classification results have been shown by opti- 
mizing this bound. Another relevant work is |13|. |13| applies 
a boosting method to optimize the margin distribution based 
generalization bound obtained by |14|. Experiments show that 
the new boosting methods achieve considerable improvements 
over AdaBoost. The optimization of this new boosting method 
is based on the AnyBoost framework |5|. Aligned with these 
attempts, we propose a new boosting algorithm through opti- 
mization of margin distribution (termed MDBoost). Instead of 
minimizing a margin distribution based generalization bound, 
we directly optimize the margin distribution: maximizing the 
average margin and at the same time minimizing the variance 
of the margin distribution. 

The theoretical justification of the proposed MDBoost is 
that, approximately, AdaBoost actually maximizes the average 
margin and minimizes the margin variance. 

The main contributions of our work are as follows. 

1 ) We propose a new totally-corrective boosting algorithm, 
MDBoost, by optimizing the margin distribution di- 
rectly. The optimization procedure of MDBoost is based 
on the idea of column generation that has been widely 
used in large-scale linear programming. 
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2) We empirically demonstrate that MDBoost outperforms 
AdaBoost and LPBoost on most UCI datasets used in 
our experiments. The success of MDBoost verifies the 
conjecture in fTTj . Our results also show that MDBoost 
has achieved similar (or better) classification perfor- 
mance compared with AdaBoost-CG |15|. AdaBoost- 
CG is also totally-corrective in the sense that all the 
linear coefficients of the weak classifiers are updated 
during the training. An advantage of MDBoost is that, 
at each iteration, MDBoost solves a quadratic program 
while AdaBoost-CG needs to solve a general convex 
program}^ 

Throughout the paper, a matrix is denoted by an upper-case 
letter (X); a column vector is denoted by a bold low-case 
letter (x). The ith row of X is denoted by Xi. and the ith 
column X-i. We use I to denote the identity matrix. 1 and 
are column vectors of I's and O's, respectively. Their sizes will 
be clear from the context. We use ^, =<; to denote component- 
wise inequalities. 

The rest of the paper is structured as follows. In Section 
[n] we present the main idea. In Section III the dual of the 
MDBoost's optimization problem is derived, which enables us 
to design an LPBoost-like column generation based boosting 
algorithm. We provide an experimental comparison of the 



algorithms on UCI data in Section IV and conclude the paper 
in Section FV] 



II. Algorithms 

Before we present our main results, we introduce some 
preliminary concepts. Let {{xi,yi)}i^i ... j^j be the set of 
training data, where Xi £ X and yi E {—1,+!}, Vi. Let 
h{-) E H he a base/weak classifier that projects an input vector 
X into [—1,+!]. We assume that the set H is finite and we 
have possible weak classifiers. Let the matrix H £ ^mxn 
where the entry of H is Hij = hj{xi). Hij is the label 
predicted by weak classifier hj{-) on the training datum Xi. 



Therefore each column H j of the matrix H consists of the 
output of weak classifier hj{-) on all the training data; while 
each row Hi- contains the outputs of all weak classifiers on 
the training datum Xi. 

Boosting is a typical example of ensemble learning, where 
multiple learners are trained to solve a single classification 
problem. A boosting algorithm creates a strong learner by 
incrementally adding weak learners to the final strong learner 
Q. The weak learner has an important impact on the strong 
learner In general, a boosting algorithm builds on a user- 
specified base learning procedure and runs it repeatedly on 
modified data that are outputs from the previous iterations. 
The final output strong classifier takes the form F{x) = 
J2f=i with Wj > 0,j = 1 ■ ■ ■ N. 

The following theorem serves as the basis of the proposed 
MDBoost algorithm. 



'More precisely, it is an constrained entropy maximization problem. To 
date, unlike quadratic programming that is a well-studied optimization prob- 
lem, there are no specialized solvers for the constrained entropy optimization 
problem. 



Theorem 2.1. AdaBoost maximizes the unnormalized average 
margin and simultaneously minimizes the variance of the mar- 
gin distribution under the assumption that the margin follows a 
Gaussian distribution. 

Proof: See appendix. ■ 
The key assumption that makes this theorem valid is that the 
weak learners generated by AdaBoost make independent errors 
over the training dataset. This assumption may not be true in 
practice, but could be a plausible approximation. 

Mathematically, the above theorem can be formulated as: 



max p - 



icr^, s.t. w ^ 0, l^w = D, 



(1) 



where (t^ is the unnormalized margin variance and p is the 
unnormalized average margin. Let pi denote the unnormalized 
margin for the ith example datum, i.e.. 



yiH^.w, Vi = 1, ■ 



,M. 



(2) 



In the above equations, w is the linear coefficients that 
weight the weak classifiers. D is the sum of these linear 
coefficients, which needs to be determined by cross-validation. 
Note that D is actually a trade-off parameter, which balances 
the normalized average margin and the normalized margin 
variance. The empirical margin variance can be computed 
as = jfZiY.i>j{Pi- Pjf- So we explicitly write the 
optimization in p: 

M 

n^'^ 2(M-i) X! ~ Pi^"^ ~ X! ^-^^ ^ 0' = ^- 

i>3 2—1 

(3) 

If we normalize the margin by setting iJ w = 1, the above 
problem is also equivalent to 

M 

™^ 2{M-1) X! '^P'- ^ Pof ^ X! ^ 0' = 1' 

1=1 

(4) 



where now p is the normalized margin. From this formulation, 
it is easy to see that D balances the two terms in the cost 
function. Both problems are equivalent to ([T]). We define a 
matrix A S 



p M X M . 



A = 



1 



M-l 



M-1 
1 



M-l 
1 

M-l 



1 



M-l M-l 

Our optimization problem can be rewritten into a simplified 
version: 

min ip^Ap— l^p, 

s.t. It) ^ 0, tJw = D, 

p,^y,H,.w,yi = l,--- ,M. (5) 

It is easy to see that A is positive semidefinit^ So (|5]l is a 
convex quadratic problem (QP) in p. 

-A is not strictly positive definite. Since the sum of A's each row is zero, 
one of A's eigenvalues is zero. 
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If we could access all the weak classifiers (the entire 
matrix H is knew), we can solve the problem (j5| using off- 
the-shelf QP solvers |16| . However, in many cases, we do 
not know H before hand simply because the size of the 
weak classifier set could be prohibitively (or even infinitely) 
large. As in LPBoost [ lOJ , column generation can be used 
to attack this problem. Column generation was first proposed 
by p7) for solving some special structured linear programs 
with extremely large number of variables. A comprehensive 



survey on this technique is 1 18|. The general idea of column 



generation is that, instead of solving the original large-scale 
problem (master problem), one works on a restricted master 
problem with a reasonably small subset of variables at each 
step. The dual of the restricted master problem is solved 
by conventional convex programming, and the optimal dual 
solution is used to find the new variable to be included into 
the restricted master problem. LPBoost is a direct application 
of column generation in boosting. For the first time, LPBoost 
shows that in a linear program framework, unknown weak 
hypotheses can be learned from the dual although the space 
of all weak hypotheses is infinitely large. This is the highlight 
of LPBoost. This idea can be generalized to solve convex 
programs other than linear programming problems]^ We next 
derive the dual of Q such that a column generation based 
optimization procedure can be devised. 

111. The Dual OF MDBoosT 

The dual of a convex program always reveals some mean- 
ingful properties of the problem. We show that MDBoost is 
in fact a regularized version of LPBoost. The Lagrangian of 
Q is 

L{w, p,u, r, q) = \p' Ap - l^p + r{'\^ w - D) 



primal dual 



w + Yl!iLi'^i{Pi- ViHi-.w), (6) 



with q 0. The infimum of L w.rt. to the primal variables is 

inf L = mi [\p^ Ap + [u - if p] 



inf [(rl^ - - J^fi, u,y,H.,,)w] - Dr. (7) 



Clearly, rlJ — — J^fLi ^iViHi-. = must hold in order to 
have a finite infimum. Therefore, we have 



j:fi^u^y^H,.,4rl^. (8) 
For the first term in L, its gradient must vanish at the optimum: 



= 0, Vi. 



(9) 



This leads to p = —A ^(m — 1); and the infimum is — ^(m — 
lfA-^{u^l). 

By putting the above results together, the dual is 



max 



Dr- ^{u-iyA-\u-l), s.t. 



(10) 



^Nevertheless, for linear programs, the optimal solution always lies at a 
vertex and column generation solves the program exactly. For general large- 
scale convex programs, only an approximate solution can be found. 



We can reformulate ( [TOj i as 

^{u~lfA-'{u~l), s.t. ^. (11) 



mm r 



Under some mild conditions, weak duality and strong duality 
exist between the primal and dual problems we have derived. 
By strong duality, the two problems are equivalent. The 
solution of the dual gives the solution to the primal. 

Note that it is critically important to keep two variables 
w and p to arrive at the dual ( [TT| . One may obtain a 
different formulation otherwise, and no column generation 
based optimization can be obtained. 

In our case, A is semidefinite but not strictly positive 
definite, and its inverse does not exist. We can replace its 
inverse A^^ with the Moore-Penrose pseudo-inverse A^. In 
our experiments, we have regularized A hy A = A + SI, 
where I is the identity matrix and S is a small constant. 

It is now clear that the dual problem ( [TT] i is a regularized 
hard-margin LPBoost. The second term in the cost function 
regularizes the dual variable u. For example, when A is 
the identity matrix, this regularization term encourages u to 
approach 1. Also note that here u can take any value and it 
is not a distribution any more. In contrast, in AdaBoost and 
LPBoost, It is a distribution: tt )>= and TJu = 1. 

The Bayesian interpretation of norm-based regularization 
is as follows. ^2-norm assumes a Gaussian prior probability 
over the parameter, and ^i-norm assumes a Laplacian prior 
probability. If we view the regularization term as the log of 
the probability for the parameter x, we have 



logp{x) 



x' A-^x, if p{x) = g{0,A), 
if p(a;) = rii exp |: 



(12) 



where Q{0,A) is a Gaussian distribution with zero mean 
and covariance A. In practice, a zero-mean and unit-variance 
Gaussian prior is usually assumed for kernel ridge regression, 
while a Laplacian prior over coefficients is typically used in 
sparse coding and compressed sensing. 

In our case, when the number of training data is large 
(M ^ 1), A can be approximated by the identity matrix. 
The regularization term is simply the variance of the weights 
associated with each datum. Intuitively, one can design A 
which contains useful prior information for some particular 
purpose. 

A. Column Generation Based Optimization 

With the above analysis, a column generalization based 
technique is ready to solve the problem (|5]l. 

Instead of solving (|5]l directly, one calculates the most 
violated constraint in ( [TT| iteratively for the current solution 
and adds this constraint to the optimization problem. In theory, 
any column that violates dual feasibility can be added. To 
speed up the convergence, we add the most violated constraint 
by solving the following problem: 



h'{-) = argmax Y.i^i'^iVM^i)- 
h(-) 



(13) 



This is actually the same as the one that standard AdaBoost 
and LPBoost use for producing the best weak classifier That is 
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Algorithm 1: Column generation based MDBoost. 

Input: Labeled training data {xi, yi),i — 1- ■ ■ M; termination 
threshold e > 0; regularization parameter D\ maximum 
number of iterations A^max- 

Initialization: N = Q; w — 0; and Ui — j-^, i = 1- ■ ■ M. 

for iteration = 1 : A^max do 

1) Obtain a new base h'{-) by solving jl3| ); 

2) Check for optimal solution: 

if iteration > 1 and X]f=i Ui.yih' (xi) < r + e, 
then break; and the problem is solved; 

3) Add h' [■) to the restricted master problem, which 
corresponds to a new constraint in the dual; 

4) Solve the dual problem \\\) and update r and Ui 

= M). 

5) Count weak classifiers N = N + 1. 

end 

Output: 

1) Compute the primal variable w from the optimality 
conditions and the last solved dual problem (primal-dual 
interior point methods output w as well); 

2) The final strong classifier is F[x) — X^jLi "^jhjix). 



to say, to find the weak classifier that has minimum weighted 
training error. We summarize our MDBoost in Algorithm [T] 

The convergence of Algorithm [T] is guaranteed by general 
column generation or cutting-plane algorithms, which is easy 
to establish. When a new h'{-) that violates dual feasibility 
is added, the new optimal value of the dual problem (maxi- 
mization) would decrease. Accordingly, the optimal value of 
its primal problem decreases too because they have the same 
optimal value due to zero duality gap. Moreover the primal 
cost function is convex, therefore in the end it converges to 
the global minimum. Note that on the last step of the proposed 
MDBoost algorithm is that we can get the value of w easily. 
Primal-dual interior-point (PD-IP) methods work on the primal 
and dual problems simultaneously and therefore both primal 
and dual variables are available after convergence. MDBoost is 
totally-corrective in the sense that the coefficients of all weak 
classifiers are updated at each iteration. 

IV. Experiments 

In this section, we run experiments to show the effectiveness 
of the proposed MDBoost algorithm. In order to control the 
complexity of the classifier, we use the stumps as weak 
classifiers. 

We first show some results on a synthetic dataset. 800 
2D points are generated as shown in Fig. [T] (top). 60% is 
used for training and the remaining for testing. We then run 
AdaBoost (1000 iterations) and MDBoost (TVmax = 1000) on 
this dataset. The cumulative margin distribution is plotted in 
Fig. [T| (middle). We have set the parameter D as the sum of 
weak classifiers' coefficients of AdaBoost. In this experiment, 
MDBoost's average margin is very similar to AdaBoost's 
average margin (both are 0.9), However, as observed, the 
variance of MDBoost is smaller than that of AdaBoost (0.027 
vs. 0.039). MDBoost also performs slightly better than Ad- 
aBoost (3.8% vs. 5.0% in test error). Note that, in terms of 
the minimum margin, AdaBoost is better than MDBoost. This 
confirms that the minimum margin is not a direct measure of 



test error. In Fig. [T] (bottom) we show the normalized value of 
w of the final selected weak classifiers for both algorithms. It 
can be seen that both algorithms select very similar decision 
stumps. However, the weights could be different. In Fig. [T] 
(bottom), the x-axis is the index of all the 2500 candidate 
weak classifiers. If a weak classifier is not selected, then its 
corresponding weight (j/-axis) is zero. 

Secondly, in order to provide a clearer insight into the 
feature selection of AdaBoost and MDBoost, we implement 
AdaBoost and MDBoost on the UCI dataset spam whose 
features have explicit meanings. The task is to separate spam 
emails based on word frequencies. The training iterations are 
both constrained as below 60, which is close to the dimension 
of feature space. We repeat the experiments for 20 times and 
record the frequency of each feature (word) being selected 
by the boosting algorithms. The average frequencies over 20 
rounds are shown as a histogram in Fig. |2] Note that there is 
a cross validation (candidates for the parameter D are {2, 5, 
8, 10, 12, 15, 20, 30, 40, 50, 70, 90, 100, 120}) for MDBoost. 
For AdaBoost, since the best test error is achieved before 60 
iterations and no over-fitting is observed during training, the 
classifier obtained at iteration 60 is considered as optimal. 
As is illustrated in Fig. [2] both algorithms select important 
features such as "free" (feature #16 on the plot), "hp" (25), 
and "$" (53) with high frequencies. However, for the other 
features, two algorithms select them with diverse inclinations. 
MDBoost tends to select the features like "address" (15), 
"order" (9) and "000" (23) which are intuitively helpful for the 
classification. On the contrary, the favorite ones of AdaBoost, 
such as "report" (14), "email" (18) and "conference" (48) 
are more irrelevant for spam email detection. The fact that 
MDBoost has smaller average test error (11.3% ±1.20%) than 
AdaBoost (12.2% ± 1.55%) supports our analysis. 

In the third experiment, we run MDBoost on real datasets 
and we focus on comparing test error. We have compared 
four boosting algorithms, which are standard AdaBoost, soft- 
margin LPBoost 1 10 1, AdaBoost-CCj^ and MDBoost, respec- 
tively. 

The cross validation values for the parameter D for MD- 
Boost and AdaBoost-CG and are {2, 5, 8, 10, 12, 15, 20, 
30, 40, 50, 70, 90, 100, 120}. The trade-off parameter C 
for LPBoost {To) are {0.001, 0.002, 0.005, 0.007, 0.01, 
0.02, 0.03, 0.05, 0.07, 0.1, 0.15, 0.2, 0.3, 0.4, 0.5}. The 
experiments are run on the 13 UCI benchmark datasets from 
|19|p] Generally, we randomly split the dataset into 3 subsets. 
60% of the examples are used for training; 20% are used for 
cross validation and the other 20% are used for test. For those 
large datasets (ringnorm, twonorm and waveform), due to 
the large size, we randomly select 10% for training, 30% for 
cross validation and 60% for test. For these 3 large datasets, 
we repeat the experiments for 10 times due to the datasets' 
large sizes. For the other 10 datasets, experiments are run for 
50 times. 

The convergence threshold e for LPBoost, AdaBoost-CG 

^AdaBoost-CG is a totally corrective version of AdaBoost. It solves 
min^ log(J^*!^^ cxp(— j/i//i:iu)), s.t. 11) )p 0, l^io = D using column 
generation. See I15j for details. 

http://ida.first.fraunhofer.de/projects/bench/| 
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index of weak classifiers 

Fig. 1. Toy data: (top) the data; (middle) the cumulative frequency of margin 
distributions; (bottom) the normalized value of w of the final learned weak 
classifiers. 



and MDBoost are all set to 10^^. Both test and training results 
for the four compared algorithms are reported in Table |l] for 
a maximum number of iterations of 100, 500 and 1000. In 
some cases, the three totally-corrective boosting algorithms 
(LPBoost, AdaBoost-CG, MDBoost) converges earlier than 
100 iterations. We simply copy the converged results to 
iteration 500 and 1000 as reported in Table [l] 

As can be seen, in terms of training error, soft-margin 
LPBoost demonstrates its fastest convergence in the training 
procedure. It finishes the column generation iteration proce- 
dure within 100 rounds for 12 datasets but only defeats other 
algorithms on training error for one dataset (banana). On the 
other hand, the standard AdaBoost, because of its coordinate 
descent optimization strategy, convergences slowest on all the 
datasets but ranks the best on training error for 10 datasets. 
MDBoost ties with the LPBoost on training error comparison 
while AdaBoost-CG outperforms on 2 data sets. 



In terms of test error, the proposed MDBoost outperforms 
on most datasets (9 among 13) and could be considered the 
best algorithm with respect to the generalization error The 
quantitative analysis for the superiority of MDBoost will be 
discussed later AdaBoost-CG has the best performance on 
3 datasets. The standard AdaBoost wins the remaining one 
(thyroid). It is surprising to find that the LPBoost performs 
slightly worse than the other algorithms on all datasets. It 
has been observed that different LP solvers may result in 
sUghtly different performances on test data for LPBoost ||9j. 
On some datasets, there is a significant difference between 
PD-IP and simplex based solvers in terms of iterations and the 
final selected weak classifiers. Here we have used Mosek (20), 
which implements PD-IP methods. Experiments with simplex 
LP solvers are needed to verify the LPBoost results. We 
leave this as future work. In summary, the proposed MDBoost 
algorithm shows competitive classification performance over 
AdaBoost, LPBoost and AdaBoost-CG. This validates the 
usefulness of optimizing margin distributions. 

In terms of computational complexity, at each iteration, 
MDBoost needs to solve a convex QP. The complexity of 
solving a QP is slightly worse than solving an LP, and it is 
still very efficient. Moreover, those techniques developed for 
solving large-scale support vector machines may be applicable 
here. AdaBoost-CG needs to solve a general convex problem 
at each iteration, which is much slower than solving a QP or 
LP |[T5|. 

In order to verify the superior classification performance 
of the proposed MDBoost quantitatively, we implement three 
statistical comparisons, namely Wilcoxon signed-rank test, 
Friedman test and Bonferroni-Dunn test, respectively pT] , on 
the experimental performances of the 4 compared boosting 
algorithms. 

The Wilcoxon signed-ranks test (WSRT) |21| is a non- 
parametric alternative of the paired t-test, which ranks the 
difference in performance of two classifiers for each dataset. 
Here the WSRT is used for comparing MDBoost with the other 
3 boosting approaches in terms of classification performance. 
The null-hypothesis declares that the concerning classifier is 
no better than the other algorithms on performance. Conse- 
quently, it is a one-tail test with a conventional confidence level 
(of 95% in this work). Further details of WSRT is illustrated 
in Table [ll] As shown, the null-hypothesis is rejected in the 
tests of MDBoost v^. AdaBoost, MDBoost vs. AdaBoost- 
CG and MDBoost vs. LPBoost. In other words, MDBoost is 
considered superior to the other 3 boosting algorithms with 
respect to generalization error AdaBoost-CG is the second 
best since it defeats AdaBoost and LPBoost in the test. 
AdaBoost and LPBoost could not be considered as better than 
any other algorithms. To conclude, WSRT indicates that, with 
a confidence level of 95%, MDBoost is the best classifier on 
the 13 datasets used. 

Friedman test (FT) is a non-parametric equivalent of the 
repeated-measures ANOVA (Analysis of Variance) pT[ . FT 
can measure the difference between more than two sets of 
classification results. If the null-hypothesis, which assumes 
that all the performances are similar to each other, is rejected, 
a post-hoc test is processed to compared the algorithms 
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Fig. 2. Tlie frequencies of different features being selected on the spam dataset. Both algorithms select important features such as "free", "hp", and "$" 
with high frequencies. 



pairwisely. The Bonferroni-Dunn test (BDT) fll] is then 
adopted as our post-hoc manner to verify whether a classifier 
over-performs the others under the circumstance of multiple- 
comparison. Not surprisingly, FT rejects its null-hypothesis, 
which means the performances of the 4 boosting approaches 
are different essentially. The confidence level for this test is 
also set to 95%0 We then run BDT. The results of BDT 
are reported in Table [III] According to BDT, different from 
WRST, only MDBoost statistically significantly outperforms 
both AdaBoost and LPBoost in test error. Also AdaBoost- 
CG is not significantly better than AdaBoost with this FT 
comparison. The conclusion that we can draw here is: with a 
confidence level of 95%, (1) MDBoost outperforms AdaBoost 
and LPBoost; (2) MDBoost and AdaBoost-CG are not signif- 
icantly different statistically. 

To take a close look at the convergence behavior of MD- 
Boost, we plot the training and test error of AdaBoost and 
MDBoost for 3 datasets in Fig. [3] Typically, MDBoost con- 
verges faster than AdaBoost due to its totally-corrective update 
rule. In terms of test error, MDBoost is better in most cases 
as we have reported. On the breast-cancer dataset, clearly 
AdaBoost over-fits the training data. Also both MDBoost and 
AdaBoost plateau before reaching zero on banana because at 
some point in the algorithms, decision stumps are not able to 
provide better error rates anymore. 

V. Conclusion 

In this paper, we have proposed a new boosting method 
that optimizes the margin distribution by maximizing the 
average margin and at the same time minimizing the margin 
variance. Inspired by LPBoost pO) , a column generation based 
optimization algorithm is proposed to facilitate this idea. 

The proposed MDBoost inherits LPBoost's advantages such 
as well-defined convergence criteria, fast convergence rates 
and less number of weak learners in the final strong classifier 

Our experiments on various datasets show that MDBoost 
outperforms AdaBoost and LPBoost; and is at least equivalent 

* In this case, the critical value equals to 2.291 with the number of 
classifiers being 4. 



to (if not better than) AdaBoost-CG in terms of classification 
accuracy. 

A future research direction is how to integrate useful prior 
information into the matrix A. We believe that improved 
performance may be obtained by carefully designing A. For 
example, one can take asymmetric data distribution into con- 
sideration by giving a weight to each margin pi, i = 1 ■ ■ ■ M. 
We also want to explore the robustness of MDBoost. Since 
MDBoost considers the whole margin distribution, it is sup- 
posed to be more robust to outliers. More experiments are 
required to test this issue. 

Appendix 



The proof of Theorem |2.1| can be found in |15|. For self- 
completeness, we include the main sketch of the proof here. 
A lemma is needed first. 

Lemma 5.1. The margin of AdaBoost follows the Gaussian 
distribution. In general, the larger the number of weak clas- 
sifiers, the more closely does the margin follow the form of 
Gaussian under the assumption that selected weak classifiers 
are uncorrelated. 

We omit the complete proof for this lemma here. The main 
tool is the central limit theorem. A condition for applying 
the central limit theorem is that the N variables must be 
independent. In our case, loosely speaking, AdaBoost selects 
independent weak classifiers such that each weak classifier 
makes different errors on the training dataset. This may not 
always hold in practice, but can be a reasonable approximation. 
So approximately we can view the selected weak classifiers 
are uncorrelated. 

We know that the cost function that AdaBoost minimizes is 



f{w) = \og[J2fiiexp-p, 



(14) 



where p, is unnormalized margin for datum a;,. As shown in 
the above lemma, pi follows a Gaussian 



1 



with mean p. and variance a^. 



exp - 



2a2 



SHEN AND LI: BOOSTING THROUGH OPTIMIZATION OF MARGIN DISTRIBUTIONS 



7 



TABLE I 

Test and training errors (in percentage %) of AdaBoost (AB), AdaBoost-CG (AB-CG), LPBoost (LP) and MDBoost (MD). For 
DATASETS twonorm, ringnorm and waveform, WE run the experiments for 10 times due to the datasets' large sizes. For all the others, 
experiments are run for 50 TIMES. The mean and standard deviation are reported. We have used decision stumps as weak 
classifiers. In most cases, MDBoost outperforms AdaBoost and LPBoost. 
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It is well known that the Monte Carlo integration method 
can be used to compute a continuous integral 

1 ^ 

g{x)f{x)dx^-Y^f{xk), (15) 
fc=i 

where g{x) is a probability distribution such that / g{x)dx = 1 
and f{x) is an arbitrary function, x^, (k = 1- ■ ■ K), are ran- 
domly sampled from the distribution g{x). The more samples 
are used, the more accurate the approximation is. 



Equation ( [T4| i can be viewed as a discrete Monte Carlo 
approximation of the following integral (here the constant term 
log AI is discarded, which is irrelevant to the analysis): 

g{p;^i,a)exp{~p) dp 



g 
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TABLE II 

Result of Wilcoxon Signed-Ranks Test (WSRT). The test is performed pairwisely among AdaBoost, AdaBoost-CG, LPBoost and 
MDBoosT. The block where "Better" takes place indicates that the algorithm corresponding to its row is better than the 
algorithm corresponding to its column while "not better" suggests the contrary. numbers in the parentheses are the 
numerical results of wsrt z (the larger the better). note that the statement "better" only takes place where z is larger than 

the critical value. the critical value depends on the number of datasets that have difference classification performances. 
Hence it is not a fixed number. In our case, there are two critical values: v\ = 61 and v2 = 70 |21). Those marked with * should be 

compared against vi and others against v2. 
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Better (73) 
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Better (78*) 

Better (86) 
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Not Better (18) 
Not Better (5) 



TABLE III 

Results of Bonferroni-Dunn Test (BDT). Each algorithms is compared with other 3 boosting manners at the same time. The block 
where "Better" takes place indicates that the algorithm corresponding to its line is better than the algorithm corresponding 
TO its column while "Not Better" suggests the contrary. Numbers in the parentheses are the numerical results of BDT z (the 

LARGER THE BETTER). NOTE THAT THE STATEMENT "BETTER" ONLY TAKES PLACE WHERE Z IS LARGER THAN THE CRITICAL VALUE. WHICH IS 2.291. 
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Fig. 3. Training eiTor and test error of AdaBoost and MDBoost for banana, breast-cancer, image datasets. These convergence plots correspond to the 
results in Table |l| 
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where Erf (x) = ^ exp —s^ds is the Gauss error function. 
The integral range is [pi , p2\ . We do not know explicitly about 
the integration range. We can roughly calculate the integral 
from — cx) to +oo. Then the last term in ([T6| is log 2 and the 
result is simple 

f{w) = -p+]^a\ (17) 

This approximation is reasonable because Gaussian distribu- 
tions drop off quickly and Gaussian is not considered a heavy- 
tailed distribution. 

Hence, AdaBoost approximately maximizes the cost func- 
tion 

= (18) 



This cost function has a clear and elegant explanation: The 
first term p is the unnormalized average margin and the second 
term is the unnormalized margin variance. It is clear that 
AdaBoost maximizes the unnormalized average margin and 
also takes minimizing the unnormalized margin variance into 
account. A better margin distribution is then obtained. 

Acknowledgment 

The authors thank the anonymous reviewers for their valu- 
able comments, which have significantly improved the quality 
of this paper. 

References 

[1] R. E. Schapire, Y. Freund, P. Bartlett, and W. S. Lee, "Boosting the 

margin: A new explanation for the effectiveness of voting methods," 

Ann. Statist., vol. 26, no. 5, pp. 1651-1686, 1998. 
[2] R. Meir and G. Ratsch, An introduction to boosting and leveraging, 

pp. 118-183, Advanced lectures on machine learning. Springer- Verlag, 

New York, NY, USA, 2003. 
[3] L. Breiman, "Prediction games and arcing algorithms," Neural Comp., 

vol 11, no. 7, pp. 1493-1517, 1999. 



SHEN AND LI: BOOSTING THROUGH OPTIMIZATION OF MARGIN DISTRIBUTIONS 



9 



[4] J. Friedman, T. Hastie, and R. Tibshirani, "Rejoiner for additive logistic 
regression: A statistical view of boosting" Ann. Statist., vol. 28, pp. 
400^07, 2000. 

[5] L. Mason, J. Baxter, R Bartlett, and M. Frean, "Boosting algorithms 
as gradient descent," in Proc. Adv. Neural Inf. Proces.s. Syst., 2000, pp. 
512-518. 

[6] L. Wang, M. Sugiyama, C. Yang, Z.-H. Zhou, and J. Feng, "On the 

margin explanation of boosting algorithms," in Proc. Annual Conf. 

Learn. Theory; Helsinki, Finland, 2008, pp. 479^90. 
[7] A. J. Grove and D. Schuurmans, "Boosting in the limit: maximizing the 

margin of learned ensembles," in Proc. National Conf. Artificial Intell. , 

Madison, Wisconsin, USA, 1998, pp. 692-699. 
[8] G. Ratsch and M. K. Warmuth, "Efficient margin maximizing with 

boosting," J. Mack Learn. Res., vol. 6, pp. 2131-2152, 2005. 
[9] M. K. Warmuth, J. Liao, and G. Ratsch, "Totally coirective boosting 

algorithms that maximize the margin," in Proc. Int. Conf. Mach. Learn. , 

Pittsburgh, Pennsylvania, 2006, pp. 1001-1008. 
[10] A. Demiriz, K.P. Bennett, and J. Shawe-Taylor, "Linear programming 

boosting via column generation," Mach. Learn., vol. 46, no. 1-3, pp. 

225-254, 2002. 

[11] L. Reyzin and R. E. Schapire, "How boosting the margin can also boost 

classifier complexity," in Proc. Int. Conf Mach. Learn., Pittsburgh, 

Pennsylvania, USA, 2006. 
[12] A. Garg and D. Roth, "Margin distribution and learning," in Proc. Int. 

Conf Mach. Learn., Washington, DC, 2003, pp. 210-217. 
[13] H. Lodhi, G. J. Karakoulas, and J. Shawe-Taylor, "Boosting the margin 

distribution," in Proc. Int. Conf. Intelli. Data Eng. & Automated Learn., 

Data Mining, Financial Eng., & Intelli. Agents, London, UK, 2000, pp. 

54-59, Springer- Verlag. 
[14] J. Shawe-Taylor and N. Cristianini, "Further results on the margin 

distribution," in Proc. Annual Conf. Learn. Theory, Santa Cruz, 

California, 1999, pp. 278-285. 
[15] C. Shen and H. Li, "On the dual formulation of boosting algorithms," 

IEEE Trans. Pattern Anal. Mach. Intell., 2010, available at http://ar xiv., 

fg/abs/090 1.3590 
Boyd and L. Vandenberghe, Convex Optimization, Cambridge 
University Press, 2004. 
[17] G. B. Dantzig and P. Wolfe, "Decomposition principle for linear 

programs," Operation Res., vol. 8, no. 1, pp. 101-111, 1960. 
[18] M. E. Liibbecke and J. Desrosiers, "Selected topics in column genera- 
tion," Operation Res., vol. 53, no. 6, pp. 1007-1023, 2005. 
[19] G. Ratsch, T. Onoda, and K.-R. Miiller, "Soft margins for AdaBoost," 
Mach. Learn., vol. 42, no. 3, pp. 287-320, 2001, data sets are available 
at http://theoval.cmp.uea.ac.uk/~gcc/matlab/index.shtml 
[20] MOSEK ApS, "The MOSEK optimization toolbox for matlab manual, 

version 5.0, revision 93," 2008, http://www.mosek.com/ 
[21] J. Dem.sar, "Statisitcal comparisons of classifiers over multiple data 
sets," / Mach. Learn. Res., vol. 7, pp. 1-30, December 2006. 



